Generalized Linear Models

Key Points

  1. Generalized linear models exist to map different distributions into linear space.

  2. Logistic regression is for binary outcome variables (e.g., closed/open, broke/working, dead/alive) and we are predicting the probability of the outcome.

  3. Our coefficients might take on different meanings when we use a different distribution.

  4. Predicted probabilities are likely the easiest way to interpret some models.

Generalized Linear Models

GLM is telling you exactly what it is within the title.

But, how is it generalized?

Distributions…

Remember how linear models really enjoy the whole Gaussian distribution scene?

The essential form of the linear model can be expressed as follows:

\[y \sim Normal(\mu,\sigma) \\ \mu = \alpha + \beta X\]

Not all data follows a Gaussian distribution. Instead, we often find some other form of an exponential distribution.

So, we need a way to incorporate different distributions of the DV into our model.

Distributions cannot do it alone!

From a theoretical perspective, link functions are tricky to get your head around.

From a conceptual perspective, all they are doing is allowing the linear predictor to “link” to a distribution function’s mean.

At the end of the day, these link functions will convert the outcome (DV) to an unbounded continuous variable.

The take-away here is that the link function describes how the mean is generated from the predictors.

Linear Regression

Linear regression is not new to us.

A linear regression deals with real numbers between \(-\infty\) to \(\infty\) in the dependent variable.

It is the most vanilla within the GLM.

How do you find a mean from a normal distribution?

In R


crimeLink <- "http://nd.edu/~sberry5/data/crimeScore.csv"

crimeScore <- data.table::fread(crimeLink)

lmTest <- glm(SSL_SCORE ~ WEAPONS_ARR_CNT, 
              data = crimeScore, 
              family = gaussian)

summary(lmTest)

Call:
glm(formula = SSL_SCORE ~ WEAPONS_ARR_CNT, family = gaussian, 
    data = crimeScore)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-263.971   -29.971     0.029    30.029   183.029  

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     300.1126     1.0112  296.79   <2e-16 ***
WEAPONS_ARR_CNT  16.8585     0.7757   21.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 2814.148)

    Null deviance: 55206149  on 19146  degrees of freedom
Residual deviance: 53876862  on 19145  degrees of freedom
  (379537 observations deleted due to missingness)
AIC: 206414

Number of Fisher Scoring iterations: 2

We see a lot of the same information, but we have a new model fit statistic to look at: deviance.

Deviance is badness. The null deviance is telling us how well our dependent variable is predicted by a model with only the intercept – this is generally the case with all null models.

Our model, captured in the residual deviance, showed a marked decrease in deviance with very little lost in degrees of freedom.

We can test it with the following:


pchisq(lmTest$null.deviance - lmTest$deviance, 
       df = lmTest$df.null - lmTest$df.residual, 
       lower.tail = FALSE)

[1] 0

We are testing a hypothesis that the difference between our null and model deviance is \(\chi^2\) distributed. By specifying that our lower tail is false, we are specifically testing a proportion of the distribution that is higher than the value that we found. This significant p-value tells us that our model is certainly better than a model without predictors!

As a final point about our output, the value of AIC is not really interpretable – it is used only for model comparison.

We can also produce our confidence intervals:


confint(lmTest)

                    2.5 %    97.5 %
(Intercept)     298.13065 302.09447
WEAPONS_ARR_CNT  15.33824  18.37886

Logistic Regression

Logistic regression is substantially different than linear regression.

Instead of that nice continuous dv, we are dealing with a binomially-distributed DV.

Our response takes the form of a binary variable.

We don’t have a \(\mu\) or \(\sigma^2\) to identify the shape of the distribution; instead we have p and n, where p is a probability and n is the number of trials.

We tend to talk about p with regard to the probability of a specific event happening (heads, wins, defaulting, etc.).

Let’s see how the binomial distribution looks with 100 trials and probabilities of .25, .5, and .75:


library(ggplot2)

library(dplyr)

binom.25 <- dbinom(1:100, size = 100, prob = .25)

binom.5 <- dbinom(1:100, size = 100, prob = .5)

binom.75 <- dbinom(1:100, size = 100 , prob = .75)

as.data.frame(rbind(binom.25, binom.5, binom.75)) %>% 
  mutate(prob = row.names(.)) %>% 
  tidyr::gather(., "key", "value", -prob) %>% 
  mutate(key = as.numeric(gsub("V", "", key))) %>% 
  ggplot(., aes(x = key, y = value, color = prob)) + 
  geom_line() + 
  scale_color_brewer(type = "qual", palette = "Dark2") + 
  theme_minimal()

If we examine the line for a probability of .5 (the orange line titled “binom.5”), we will see that it is centered over 50 – this would suggest that we have the highest probability of encountering 50 successes. If we run 20 trials and the outcome is 50/50, we would expect to see success in half the trials and a decreasing number of trials for more or less successes. Shifting our attention to a .75 probability of success, we see that our density is sitting over 75.

Since we are dealing with a number of trials, it is worth noting that the binomial distribution is a discreet distribution. If you have any interest in knowing the probability for a number of success under the binomial distribution, we can use the following formula:

\[P(x) = \frac{n!}{(n-x)!x!}p^xq^{n-x}\]

While we don’t need to dive too deep into finding those specific values for the binomial distribution, we can spend our time exploring how it looks in linear model space:

\[y \sim Binomial(n, p) \\ logit(p) = \alpha + \beta X\]

The logit function is defined as:

\[log\frac{p}{1-p}\]

We are literally just taking the log of the odds (the log odds becomes important later).

Now we can map this back to our model:

\[log\frac{p}{1-p} = \alpha + \beta X\]

And finally we can take that logistic function and invert it (the inverse-logit) to produce the probabilities.

\[p = \frac{exp(\alpha + \beta X)}{1 + exp(\alpha + \beta X)}\]

We will see more about how this happens after playing with the following model:


library(data.table)

kickstarter <- fread("https://www.nd.edu/~sberry5/data/kickstarter.csv")

kickstarter <- kickstarter[state == "successful" | state == "failed", 
                           ][, state := ifelse(state == "successful", 1, 0)] 

# You can set the variable types individually:

# kickstarter$backers <- as.numeric(kickstarter$backers)
# kickstarter$goal <- as.numeric(kickstarter$goal)


# But the following is more efficient:
numericVars <- c("backers", "goal")

kickstarter[, (numericVars) := lapply(.SD, function(x) as.numeric(x)), 
            .SDcols = numericVars]

Let’s start with a visual breakdown of failure and success:


ggplot(kickstarter, aes(state)) +
  geom_bar() +
  theme_minimal()

And a table:


addmargins(table(kickstarter$state))

     0      1    Sum 
168221 113081 281302 

If we take the number of successes and divide by the total, we will get the probability of being funded:


113081 / 281302

[1] 0.4019915

The probability of being funded is 0.4019915

Continuous Predictors

While just knowing the probability is great, it does not provide much of a model. So, we can add predictor variables to model their relationship with the outcome variable.


logTest <- glm(state ~ backers, data = kickstarter, 
              family = binomial)

summary(logTest)

Call:
glm(formula = state ~ backers, family = binomial, data = kickstarter)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-8.4904  -0.6800  -0.6408   0.7450   1.8353  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.4788736  0.0060867    -243   <2e-16 ***
backers      0.0264517  0.0001336     198   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 379089  on 281301  degrees of freedom
Residual deviance: 272617  on 281300  degrees of freedom
AIC: 272621

Number of Fisher Scoring iterations: 11

We are now dealing with log odds in the coefficients. We would say that for every unit increase in backers, the log odds of a campaign being funded is ~.03.

Let’s also visualize this and come back to it later!


ggplot(kickstarter) +
  geom_count(aes(backers, state)) + 
  theme_minimal()

Some Logistic Grounding

Log Odds

As its default, R produces the log odds for the regression coefficient in a logistic model.

Probability lies at the heart of all of this. We can look at the relationship between the probability, odds, and log odds.


probabilityList <- c(.001, .01, .15, .2, 
                    .25, .3, .35, .4, .45, 
                    .5, .55, .6, .65, .7, 
                    .75, .8, .85, .9)

We have a list of probability values (always between 0 and 1). Now, let’s write a function to convert them to odds. We will use the \(\\p\, / 1 - p\) equation.


oddsConversion <- function(p) {
  res = p / (1 - p)
  return(res)
}

odds <- oddsConversion(probabilityList)

plot(probabilityList, odds)

Now, we can convert them to log odds:


plot(odds, log(odds))

We can clearly go back and forth between the 3, but the main message here is that we took a bounded variable in probability and transformed it to continuous space.

The Intercept

The intercept is offering the log odds of a campaign with 0 backers being funded – converting this back to probability (just adding the exponentiation into solving back to probability), we get the following:


exp(coef(logTest)["(Intercept)"]) / (1 + exp(coef(logTest)["(Intercept)"]))

(Intercept) 
  0.1855976 

We are dealing with a pretty small probability (but certainly not 0) that a campaign with 0 backers could be successful.

The Predictor

Recall that the coefficient in log odds for the backers was .02645174.

Let’s start at the median value of backers:


median(kickstarter$backers)

[1] 15

Now, let’s solve our equation for that median value. This will produce the conditional logit.


medLogit <- logTest$coefficients["(Intercept)"] + 
  (logTest$coefficients["backers"] * median(kickstarter$backers))

names(medLogit) <- NULL

medLogit

[1] -1.082097

Let’s do the same thing for the next sequential value of backers:


medLogitPlus1 <- logTest$coefficients["(Intercept)"] + 
  (logTest$coefficients["backers"] * (median(kickstarter$backers) + 1))

names(medLogitPlus1) = NULL

medLogitPlus1

[1] -1.055646

Now we have two sequential conditional logits that we can subtract from each other:


backersCoef <- medLogitPlus1 - medLogit

backersCoef

[1] 0.02645174

This is exactly the coefficient that we got from the model. If we exponentiate the log odds, we are “unlogging” it to get the odds ratio that we saw earlier (just plain old odds at this point). This is sometimes referred to as the proportional change in odds.


exp(backersCoef)

[1] 1.026805

For every unit increase in backer, we have a 2% increase in the odds that the campaign will be successful. These odds are mostly stacking as we increase (see the plot below). When we look at the odds, anything above 1 is in favor of moving from the “0” category to the “1” category, whereas anything below 1 is in favor of not moving from the “0”. From a hypothesis perspective, this makes absolute sense – we would expect that having more backers contributes to a better chance of funding.

There are interesting issues at play here with regard to our predictor coefficients (what can be considered a relative effect) and the model’s effect as a whole on the probability (the absolute effect). In circumstances where the intercept is very large (essentially promising a success), the relative effect of a coefficient is practically meaningless. Similarly, very negative coefficients render the relative effects useless.

Putting It Together (Hopefully)

When we construct our linear model, we are trying to predict the response value – not the case with logistic regression. Instead we are trying to predict the probability of going from 0 to 1 (not funded to funded). So, normal plots that we might usually create with linear models will do us no good here.

First, we need to make our predicitions and put them back into the data:


kickstarter$predictedProbs <- predict(logTest, 
                                      type = "response")

The predict function is how we get the predictions from any model. Do note the type argument with the “response” specification; this ensures that we are using the type of response particular to the model.

Now, we can plot it:


ggplot(kickstarter[backers < 500], aes(backers, predictedProbs)) +
  geom_line(size = 1.5) +
  theme_minimal()

For our data, once you start to get above 250 backers, there is a probablity of 1 that it will be successful.

If we wanted to apply new data to our model, we would also use the predict() function, but give it new data.

Let’s add one more continuous predictor into the model.


twoPredictors <- glm(state ~ backers + goal, data = kickstarter, 
              family = binomial)

summary(twoPredictors)

Call:
glm(formula = state ~ backers + goal, family = binomial, data = kickstarter)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-8.4904  -0.6313  -0.0472   0.3570   8.4904  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.960e-01  7.037e-03  -127.3   <2e-16 ***
backers      5.106e-02  2.344e-04   217.9   <2e-16 ***
goal        -2.053e-04  1.201e-06  -170.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 379089  on 281301  degrees of freedom
Residual deviance: 191135  on 281299  degrees of freedom
AIC: 191141

Number of Fisher Scoring iterations: 25

exp(twoPredictors$coefficients)

(Intercept)     backers        goal 
  0.4081851   1.0523893   0.9997948 

Lets run that same model, but with some tweaks, and then do something cool!


library(plotly)

truncatedData <- kickstarter[backers >= median(backers) &
                               backers <= quantile(backers, .75) &
                               goal >= median(goal) &
                               goal <= quantile(goal, .75)]

summary(truncatedData)

       ID                name             category        
 Min.   :2.237e+05   Length:18833       Length:18833      
 1st Qu.:5.366e+08   Class :character   Class :character  
 Median :1.074e+09   Mode  :character   Mode  :character  
 Mean   :1.075e+09                                        
 3rd Qu.:1.611e+09                                        
 Max.   :2.147e+09                                        
                                                          
 main_category        currency           deadline        
 Length:18833       Length:18833       Length:18833      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
      goal         launched           pledged         
 Min.   : 5000   Length:18833       Length:18833      
 1st Qu.: 5000   Class :character   Class :character  
 Median : 7500   Mode  :character   Mode  :character  
 Mean   : 8374                                        
 3rd Qu.:10000                                        
 Max.   :15000                                        
                                                      
     state           backers        country         
 Min.   :0.0000   Min.   :15.00   Length:18833      
 1st Qu.:0.0000   1st Qu.:22.00   Class :character  
 Median :0.0000   Median :33.00   Mode  :character  
 Mean   :0.3444   Mean   :34.68                     
 3rd Qu.:1.0000   3rd Qu.:47.00                     
 Max.   :1.0000   Max.   :62.00                     
                                                    
 usd pledged            V14                V15           
 Length:18833       Length:18833       Length:18833      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
     V16                 V17        predictedProbs  
 Length:18833       Min.   : NA     Min.   :0.2531  
 Class :character   1st Qu.: NA     1st Qu.:0.2897  
 Mode  :character   Median : NA     Median :0.3530  
                    Mean   :NaN     Mean   :0.3672  
                    3rd Qu.: NA     3rd Qu.:0.4414  
                    Max.   : NA     Max.   :0.5402  
                    NA's   :18833                   

twoPredictors <- glm(state ~ backers + goal, 
                     data = truncatedData, 
                     family = binomial)

backers <- unique(truncatedData$backers)
goals <- unique(truncatedData$goal)

grid <- expand.grid(backers, 
                    goals)

newData <- setNames(data.frame(grid), c("backers", "goal"))
predictedVals <- predict(twoPredictors, 
                         newdata = newData, 
                         type = "response")

zMatrix <- matrix(predictedVals, 
            nrow = length(unique(newData$backers)), 
            ncol = length(unique(newData$goal)))

plot_ly() %>% 
  add_surface(x = ~goals, y = ~backers, z = ~zMatrix)

Categorical Predictors

Let’s pick a few categories to work on.


catData <- kickstarter[main_category == "Film & Video" |
           main_category == "Music" |
           main_category == "Games", 
           .(state, main_category)]

Let’s look a the crosstabs for those variables:


addmargins(table(catData))

     main_category
state Film & Video  Games  Music    Sum
  0          29653  13013  19193  61859
  1          21404   9385  21763  52552
  Sum        51057  22398  40956 114411

We can start to look at the odds for each category being successful:


filmOdds <- (21404 / 51057) / (29653 / 51057) # You can reduce this to 21404 / 29653

gamesOdds <- (9385 / 22398) / (13013 / 22398)

musicOdds <- (21763 / 40956) / (19193 / 40956)

filmOdds

[1] 0.7218157

gamesOdds

[1] 0.7212019

musicOdds

[1] 1.133903

And we can take each one back to a probability:


filmOdds / (1 + filmOdds)

[1] 0.4192177

gamesOdds / (1 + gamesOdds)

[1] 0.4190106

musicOdds / (1 + musicOdds)

[1] 0.5313751

We can take our probability formulation and divide 1 - the probability to recover our odds:


filmOdds / (1 + filmOdds) / (1 - (filmOdds / (1 + filmOdds)))

[1] 0.7218157

musicOdds / (1 + musicOdds) / (1 - (musicOdds / (1 + musicOdds)))

[1] 1.133903

When we consider what our odds mean now, we can envision them as the probability of the outcome occurring divided by the probability that it does not happen.

We could also compare those two directly:


(21763 / 19193) / (21404 / 29653)

[1] 1.570904

So the odds that a music campaign will be funded are about 57% higher than the odds for a film.

And we could also compare film to games:


(9385 / 13013) / (21404 / 29653)

[1] 0.9991497

Here, the odds that a game will be funded are less than 1% lower than the odds for a film.

Let’s run our logistic regression now:


categoryLogTest <- glm(state ~ main_category, 
                       data = catData, 
                       family = binomial)

summary(categoryLogTest)

Call:
glm(formula = state ~ main_category, family = binomial, data = catData)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.231  -1.042  -1.042   1.319   1.319  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -0.3259855  0.0089690 -36.346   <2e-16 ***
main_categoryGames -0.0008507  0.0162432  -0.052    0.958    
main_categoryMusic  0.4516511  0.0133602  33.806   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 157849  on 114410  degrees of freedom
Residual deviance: 156517  on 114408  degrees of freedom
AIC: 156523

Number of Fisher Scoring iterations: 4

exp(coef(categoryLogTest))

       (Intercept) main_categoryGames main_categoryMusic 
         0.7218157          0.9991497          1.5709038 

filmOdds

[1] 0.7218157

The film group is about 28% less likely to be funded than everything else.

Cautions About Logistic Regression

Bivariate relationships are really important for logistic models.

Your model needs to see some dispersion of values over the bivariate tables.

Logistic regression requires a larger sample size than what a linear regression needs.

A Big “Gotcha”

\(R^2\) does not apply to a logistic regression.

Here is a great breakdown of different pseudo methods.

Logistic Practice

Remember our original model had some issues with separation. Now is a good time to try to fix that issue. An easy place to start might be to tidy up the range of backers.

Determine some cut-off values for backers and build a new model. See if you can take care of the separation issue and if you can improve upon the coefficient.

Poisson Regression

Key Points

  1. Poisson regression is for count-based dependent variables (e.g., how many touchdowns does a team get, how many safety violations does a factory have, how many credit cards do you have under your name)

  2. If you have a lot of zeroes in your data, you might want to consider a zero-inflated Poisson.

  3. If your count variable does not follow a Poisson distribution, then you might want to use a negative binomial model.

The Poisson Distribution

The Poisson distribution is very similar to the binomial distribution, but has some key differences. The biggest difference is in its parameter: Poisson has a single parameter noted as \(\lambda\). This rate parameter is going to estimate the expected number of events during a time interval. This can be accidents in a year, pieces produced in a day, or hits during the course of a baseball season. We can find the rate by determining the number of events per interval, multiplied by the interval length.

\[\frac{\text{event}}{\text{interval}}*\text{interval length} \]

To put some numbers to that, if we have 1 accident per week in a factory and we are observing a whole year, we would get the following rate:


(1 / 7) * 28

[1] 4

We expect about 4 accidents over the course of a month.

Let’s see what that particular distribution might look like:


ggplot(data.frame(x = 0:20), aes(x)) +
  geom_col(aes(y = dpois(x, (1 / 7) * 28)), fill = "#ff5500") +
  theme_minimal()

We can also see what it looks like for different rates (some places might be safer than others):


ggplot() +
  geom_col(data = data.frame(x = 1:28, y = dpois(1:28, (1 / 7) * 28)),
           mapping = aes(x, y), width = .97, alpha = .25, fill = "red") +
  geom_col(data = data.frame(x = 1:28, y = dpois(1:28, (3 / 7) * 28)),
           mapping = aes(x, y), width = .97, alpha = .25, fill = "blue") +
  geom_col(data = data.frame(x = 1:28, y = dpois(1:28, (5 / 7) * 28)),
           mapping = aes(x, y), width = .97, alpha = .25, fill = "green") +
  theme_minimal()

The (Sometimes) Thin Line

This gets into an area where we need to think long and hard about our dependent variable and what it actually might be.


library(data.table)
library(vcd)

shroudData <- fread("https://www.nd.edu/~sberry5/data/shroudData.csv")

poissonTest <- goodfit(shroudData$shroudsProduced, type = "poisson")

summary(poissonTest)

     Goodness-of-fit test for poisson distribution

                      X^2 df  P(> X^2)
Likelihood Ratio 13.28809 14 0.5039751

This is a \(\chi^2\) to test if the distribution deviates from a Poisson. If we see a significant value, we would say that it deviates from the tested distribution. In this case, everything looks okay.

We can also plot that test using a hanging rootogram:


plot(poissonTest)

The bars are the observed counts and the red line/points are the fitted counts (i.e., how many would be expected). If a bar does not reach the 0 line, then the model would over-predict for that particular count; if the bar dips below the 0 line, the model under-predicts that count.

Dispersion

For models of this nature (our dependent variable is a count variable), we may have two different distributions with which to operate: the Poisson distribution or the negative binomial distribution.

Let’s check this out (it will be important later on!).


shroudData[, list(mean = mean(shroudsProduced), 
                  var = var(shroudsProduced)), 
           by = employeeCount]

    employeeCount     mean      var
 1:            27 5.421875 4.882688
 2:            17 5.144928 4.861040
 3:            20 5.612903 4.765732
 4:            22 5.052632 4.743860
 5:            24 5.276923 4.484615
 6:            30 8.250000 4.584507
 7:            21 5.111111 4.494523
 8:            19 5.298701 6.106972
 9:            28 8.053571 4.524351
10:            16 5.354430 4.795846
11:            29 7.725806 4.267848
12:            15 4.906250 3.260913
13:            23 5.534247 5.891172
14:            25 4.802198 4.627106
15:            18 4.763636 4.739394
16:            26 5.293103 5.895039

What is the purpose of this? We are checking the conditional means and variances. Why is this important? If our variances are larger than our means, we have overdispersion. We would expect values to be equally distributed over levels, but if they are really spread out, this qualifies as overdispersion – this is not good for our Poisson model because it will cause downward bias.

As an overly simple check, we can also compare the mean and the variance of our outcome variable – they should be close to equal!


mean(shroudData$shroudsProduced)

[1] 5.684018

var(shroudData$shroudsProduced)

[1] 5.96405

Not terribly far off, but we will see if it becomes a problem later. You might be wondering why overdispersion is a problem – it goes back to the single parameter within the Poisson distribution. The normal distribution has a parameter for dealing with variance – \(\sigma\) – Poisson does not, so the assumption is that any variance would be equal to the mean.

The Model

Recall that every distribution has a link function (or several) that tend to work well for it. The poisson distribution uses a log link function:

\[y = Poisson(\lambda) \\ \text{log}(\lambda) = \alpha + \beta X\]

Using the log link keeps the outcome positive (we cannot deal with negative counts). Logs, as they are prone to do, are going to tend towards an exponential relationship; just be sure that it makes sense over the entire range of your data.


poissonTest = glm(shroudsProduced ~ employeeCount,
                  data = shroudData,
                  family = poisson)

summary(poissonTest)

Call:
glm(formula = shroudsProduced ~ employeeCount, family = poisson, 
    data = shroudData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.5346  -0.7647  -0.0033   0.5966   3.0804  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.084013   0.065469   16.56   <2e-16 ***
employeeCount 0.028771   0.002791   10.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1200.8  on 1094  degrees of freedom
Residual deviance: 1094.1  on 1093  degrees of freedom
AIC: 4926.3

Number of Fisher Scoring iterations: 4

exp(poissonTest$coefficients)

  (Intercept) employeeCount 
     2.956519      1.029189 

Important Note: We are going to interpret this almost the same as a linear regression. The slight wrinkle here, though, is that we are looking at the log counts (remember that we specified the log link function). In other words, an increase in one employee leads to an expected log count increase of ~.029. Just like our logisitc regression, we could exponentiate this to get 1.029189 – every employee we add gets us a ~3% increase in shrouds produced. Let’s see what this looks like in action:


shroudData = shroudData %>%
  mutate(predValues = predict(poissonTest, type = "response"))

ggplot(shroudData, aes(employeeCount, predValues)) +
  geom_count() +
  scale_size_area() +
  theme_minimal()


pchisq(poissonTest$deviance, poissonTest$df.residual, lower.tail = FALSE)

[1] 0.4848536

With everything coupled together, we have a meaningful coefficient, a clear plot, and adequate model fit. Therefore, we might conclude that there is a positive relationship between number of employees on shift and shrouds produced.

In addition to checking our data for over dispersion before running the model, we can also check it after running our model:


library(AER)

dispersiontest(poissonTest)

    Overdispersion test

data:  poissonTest
z = -1.3997, p-value = 0.9192
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
 0.9452052 

The dispersion value that we see returned (0.9452052 in our case) should be under 1. A dispersion value over 1 means that we have overdispersion. Our dispersion value, coupled with our high p-value, indicates that we would fail to reject the null hypothesis of equidispersion.

We can also look back to our model results to compare our residual deviance to our residual deviance degrees of freedom; if our deviance is greater than our degrees of freedom, we might have an issue with overdispersion. Since we are just a bit over and our overdispersion tests do not indicate any huge issue, we can be relatively okay with our model. If we had some more extreme overdispersion, we would want to flip to a quasi-poisson distribution – our coefficients would not change, but we would have improved standard errors.

Zero-inflated Poisson (ZIP)

Sometimes we have a seeming abundance of zero values within our data. We can have employees with zero absence periods, lines with zero quality failures, and days without safety issues. What is the process that generated the zeros? Are they coming from our count model (“true” zeroes) or something else (some random process)? This is where zero-inflated models become important. ZIP models are mixture models. We are not going to dive too deeply into this, but all you need to know is that a mixture model contains a “mixture” of different distributions:

\[y \sim \text{ZIP}(p,\lambda) \\ \text{logit} = \alpha_p + \beta_pX \\ \text{log} = \alpha_\lambda + \beta_\lambda X\]

This just formalizes the fact that our y can be generated from multiple processes.


redlights <- fread("https://www.nd.edu/~sberry5/data/redlights.csv")

poissonRedlight <- glm(citation_count ~ as.factor(camera_installed_ny),
            data = redlights,
            family = poisson)

summary(poissonRedlight)

Call:
glm(formula = citation_count ~ as.factor(camera_installed_ny), 
    family = poisson, data = redlights)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6361  -0.7131  -0.0428   0.6445   2.5624  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)
(Intercept)                      2.29289    0.01455  157.59   <2e-16
as.factor(camera_installed_ny)1 -0.88529    0.02650  -33.41   <2e-16
                                   
(Intercept)                     ***
as.factor(camera_installed_ny)1 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2280.2  on 975  degrees of freedom
Residual deviance: 1059.8  on 974  degrees of freedom
AIC: 4581.8

Number of Fisher Scoring iterations: 4

With this output, we are comparing citation counts against intersections without a camera and those with a camera.

We see that our coefficient is -.88529 –- this means that having a camera leads to having .88529 less log counts than without having a camera.

We can also exponentiate that value to get an incident rate:


exp(poissonRedlight$coefficients)

                    (Intercept) as.factor(camera_installed_ny)1 
                      9.9035639                       0.4125961 

Let’s see what those mean with an example:


tapply(X = redlights$citation_count, 
       INDEX = redlights$camera_installed_ny, 
       FUN = mean) # Old-school group_by and summarize!

       0        1 
9.903564 4.086172 

4.086172/9.903564

[1] 0.4125961

Dividing the mean of the target group by the mean of the reference group will get us the incidence rate (i.e., for every one time someone runs a red light without a camera, it happens .41 times with camera).

If we take a look at citation_count’s distribution, we will see more than a few 0’s.


hist(redlights$citation_count)

For our redlight data, we saw that having a camera present had an effect on citations, but would it cause 0 citations? Or might there be something else contributing to the 0’s (e.g., no cars going through that intersection due to construction, no police nearby)? If there are no cars going through the intersection due to construction, is there even a chance of obtaining a non-zero response?


library(pscl)

zipTest <- zeroinfl(citation_count ~ as.factor(camera_installed_ny),
                   dist = "poisson", data = redlights)

# Important note: If you want your logistic model to differ from your
# poisson model, you would do this:

zipTest <- zeroinfl(citation_count ~ 
                      as.factor(camera_installed_ny) | 
                      as.factor(camera_installed_ny),
                    dist = "poisson", data = redlights)

summary(zipTest)

Call:
zeroinfl(formula = citation_count ~ as.factor(camera_installed_ny) | 
    as.factor(camera_installed_ny), data = redlights, dist = "poisson")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-2.82923 -0.68433 -0.04265  0.66616  2.92554 

Count model coefficients (poisson with log link):
                                Estimate Std. Error z value Pr(>|z|)
(Intercept)                      2.29290    0.01455  157.59   <2e-16
as.factor(camera_installed_ny)1 -0.88528    0.02650  -33.41   <2e-16
                                   
(Intercept)                     ***
as.factor(camera_installed_ny)1 ***

Zero-inflation model coefficients (binomial with logit link):
                                Estimate Std. Error z value Pr(>|z|)
(Intercept)                       -27.82   50417.30  -0.001        1
as.factor(camera_installed_ny)1    12.72   50417.48   0.000        1
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 20 
Log-likelihood: -2289 on 4 Df

Here we have two sets of coefficients: one for our count model (our actual Poisson regression) and one model attempting to account for excessive 0s. Our count model does not change, but we also see that our zero-inflation model will not account for the 0s within our data. This would be a clear indication that we have something else that might be contributing to the number of 0s that we see.

Negative Binomial

Remember that whole issue with our conditional means and variances? If we would have had problems those means and variances, we would need to abandon our Poisson distribution in favor of the negative binomial. The poisson distribution works when the sample mean and variance are equal – the negative binomial distribution frees that constraint and allows them to vary freely.

Remember this:


redlights[, 
          list(mean = mean(citation_count), 
               var = var(citation_count)), 
          by = camera_installed_ny] 

   camera_installed_ny     mean       var
1:                   0 9.903564 11.083118
2:                   1 4.086172  3.994567

Those look like the start of problems. Let’s check our whole sample now:


mean(redlights$citation_count)

[1] 6.929303

var(redlights$citation_count)

[1] 15.91602

There is clearly a problem here!


library(MASS)

nbTest <- glm.nb(citation_count ~ as.factor(camera_installed_ny), 
                 data = redlights)

summary(nbTest)

Call:
glm.nb(formula = citation_count ~ as.factor(camera_installed_ny), 
    data = redlights, init.theta = 110.2437215, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.5418  -0.6853  -0.0420   0.6157   2.4277  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)
(Intercept)                      2.29289    0.01519  150.96   <2e-16
as.factor(camera_installed_ny)1 -0.88529    0.02719  -32.56   <2e-16
                                   
(Intercept)                     ***
as.factor(camera_installed_ny)1 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(110.2437) family taken to be 1)

    Null deviance: 2148.6  on 975  degrees of freedom
Residual deviance:  998.6  on 974  degrees of freedom
AIC: 4581.6

Number of Fisher Scoring iterations: 1

              Theta:  110.2 
          Std. Err.:  79.2 

 2 x log-likelihood:  -4575.634 

The interpretation of our negative binomial is exactly the same as our Poisson model – we have only relaxed the assumptions of our distribution! You might notice our model fits better (even if slightly so) by using the negative binomial.

Another Distribution Worth Noting

The beta distribution has many practical uses. It can be used to model variables that are bound by 0 and 1 (but do not hit 0 or 1). If you have an outcome variable that is a proportion, it would be useful to keep the beta distribution in mind. We will also see the beta distribution in the context of Bayesian analysis, because we can use it as a prior for R^2.